10/8/2020

Agenda

  • Goals of model selection and regularization
  • Subset selection
  • Ridge regression
  • Lasso regression

Note: although we talk about regression here, everything applies to logistic regression as well (and hence classification).

Agenda

Main idea

Model Building Process

When building a regression model remember that simplicity is your friend. Smaller models are easier to interpret and have fewer unknown parameters to be estimated.

Keep in mind that every additional parameter represents a cost!

The first step of every model building exercise is the set of the universe of variables to be potentially used. This task is entirely solved through your experience and context specific knowledge:

  • Think carefully about the problem
  • Consult subject matter research and experts
  • Avoid the mistake of selecting too many variables

Model Building Process

With a set of variables in hand, the goal now is to select the best model. Why not include all the variables?

Big models tend to over-fit and find features that are specific to the data in hand, i.e. not generalizable relationships.

The results are bad predictions and bad science!

In addition, bigger models have more parameters and potentially more uncertainty about everything we are trying to learn.

We need a strategy to build a model in ways that account for the trade-off between bias and variance: subset selection, shrinkage, dimension reduction.

Introduction

Recall the linear model

\[Y = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p + \varepsilon\] We will here focus on improving the linear model by using some other model-fitting strategy beyond ordinary least squares (OLS).

Why moving beyond OLS?

  1. Improved prediction accuracy
  2. More interpretable models

Reason 1: prediction accuracy

  • A linear model is generally thought of as a “high bias, low variance” kind of estimator, at least compared with alternatives we will see
  • But if the number of observations \(n\) is not much larger than the number of features \(p\), even OLS can have high estimation variance
  • Extreme case: if \(p > n\), there is not even a unique OLS solution! No option but to do something else
  • Result: overfitting and poor prediction accuracy
  • Potential solution: shrinking or regularizing the coefficient estimates so that they don’t just chase noise in the training data

Reason 2: model interpretability

  • A major source of variance in OLS estimation is the inclusion of unnecessary variables in the regression equation (“unnecessary” meaning \(\beta_j = 0\))
  • This situation is especially pronounced in situations of sparsity: where you have lots of candidate features for predicting \(Y\), only a few of them are actually useful, but you do not know which ones
  • By removing these variables, we get a simpler model that is easier to interpret and communicate

Potential solutions

  • Subset Selection: we identify a subset of the \(p\) predictors that we believe to be related to the response, then fit a model using least squares on the subset of variables
  • Shrinkage: we fit a model involving all \(p\) predictors, but the estimated coefficients are shrunk towards 0 relative to the least square solutions. This shrinkage (also known as regularization) has the effect of reducing variance and can also perform variable selection
  • Dimension Reduction: we project the \(p\) predictors into a \(M\)-dimensional subspace, where \(M < p\). This is achieved by computing \(M\) different linear combinations, or projections of the variables. Then these \(M\) projections are used as predictors to fit a linear regression model by least squares

Subset Selection

Main methods

There are 3 options:

  • Best subset selection: fit all possible models and compare their performance
  • Forward selection: starting from the null model, adds one variable at the time until no remaining variable makes a significant contribution (or meets a certain criteria)
  • Backwards selection: starting with the full model, removes one variable at the time until further deletions would do more harm them good

The last two options are known as “stepwise regression”: they build a regression model step-by-step and they are computationally efficient.

Best subset selection

The idea is very simple: fit all possible models and compare their performance based on some criteria (measure the generalization error of each one on a testing set).

Issues?

  • How many possible models? Total number of models \(= 2^p\). But this can be too exhausting
  • What criteria to use? Just as before, if prediction is what we have in mind, out-of-sample predictive ability should be the criteria
  • This makes things even worse: it will take even longer to repeatedly re-fit each model to multiple training sets. Why?

Best subset selection

  1. Let \(\mathcal{M}_0\) denote the null model, which contains no predictors, only the intercept. This model simply predicts the sample mean for each observation
  2. For \(k = 1, \dots, p\):
    • Fit all \(\left( \begin{array}{c} p \\ k \\ \end{array} \right)\) models that contain exactly \(k\) predictors
    • Pick the best among these \(\left( \begin{array}{c} p \\ k \\ \end{array} \right)\) models, and call it \(\mathcal{M}_k\). Here best is defined as having the smallest RSS, or equivalently largest \(R^2\)
  3. Select a single best model from \(\mathcal{M}_0, \dots, \mathcal{M}_p\) using cross-validated prediction error, \(C_p\), AIC, BIC or adjusted \(R^2\)

Best subset selection: example

Forward stepwise selection

  • For computational reasons, best subset selection cannot be applied with large \(p\)
  • Best subset selection may also suffer from statistical problems when \(p\) is large: the larger the search space, the higher the chance of finding models that look good (and overfit) on the training data
  • For both of these reasons, stepwise methods, which explore a far more restricted set of models, are attractive alternatives to the best subset selection
  • Forward stepwise selection begins with an empty model, and then adds predictors to the model, one at a time, until all of the predictors are in the model
  • In particular, at each step the variable that gives the greatest additional improvement (reduction of RSS or increase of \(R^2\)) to the fit is added to the model

Forward stepwise selection

  1. Let \(\mathcal{M}_0\) denote the null model, which contains no predictors, only the intercept
  2. For \(k = 0, \dots, p-1\):
    • Consider all \(p-k\) models that augment the predictors in \(\mathcal{M}_k\) with one additional predictor
    • Choose the best among these \(p-k\) models, and call it \(\mathcal{M}_{k+1}\). Here best is defined as having the smallest RSS, or equivalently largest \(R^2\)
  3. Select a single best model from \(\mathcal{M}_0, \dots, \mathcal{M}_p\) using cross-validated prediction error, \(C_p\), AIC, BIC or adjusted \(R^2\)

Backwards stepwise selection

  • Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection
  • However, unlike forward stepwise selection, it begins with the full least squares model containing all \(p\) predictors, and then iteratively removes the least useful predictor, one at a time

Backwards stepwise selection

  1. Let \(\mathcal{M}_p\) denote the null model, which contains all \(p\) predictors
  2. For \(k = p, p-1 \dots, 1\):
    • Consider all \(k\) models that contain all but one of the predictors in \(\mathcal{M}_k\), for a total of \(k-1\) predictors
    • Choose the best among these \(k\) models, and call it \(\mathcal{M}_{k-1}\). Here best is defined as having the smallest RSS, or equivalently largest \(R^2\)
  3. Select a single best model from \(\mathcal{M}_0, \dots, \mathcal{M}_p\) using cross-validated prediction error, \(C_p\), AIC, BIC or adjusted \(R^2\)

Backwards stepwise selection

  • Like forward stepwise selection, the backward selection approach searches through only \(1 + p(p+1)/2\), and so can be applied in setting where \(p\) is too large to apply best subset selection
  • Like forward stepwise selection, backward selection is not guaranteed to yield the best model containing a subset of the \(p\) predictors
  • Backward selection requires that the number of samples \(n\) is larger than the number of variables \(p\) (so that the full model can be fit). In contrast, forward stepwise can be used even when \(n < p\), and so is the only viable subset method when \(p\) is very large

Information Criteria

Another way to evaluate a model is to use Information Criteria metrics which attempt to quantify how well our model would have predicted the data (regardless of what you have estimated for the \(\beta_j\)’s).

  • The model containing all of the predictors will always have the smallest RSS and the largest \(R^2\), since these quantities are related to the training error
  • We wish to choose a model with low test error, not low training error. Recall that training error is usually a poor estimate of test error
  • Therefore, RSS and \(R^2\) are not suitable for selecting the best model among a collection of models with different numbers of predictors

Information Criteria

We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting.

  1. \(C_p\)
  2. AIC
  3. BIC
  4. Adjusted \(R^2\)

Bottom line: these are only useful if we lack the ability to compare models based on their out-of-sample predictive ability!

Mallow’s \(C_p\)

  • Mallow’s \(C_p\): \[C_p = \frac{1}{n}(RSS + 2 d \hat{\sigma}^2)\] where \(d\) is the total number of parameters used and \(\hat{\sigma}^2\) is an estimate of the residual variance for the full model.

Aikake Information Criterion

  • AIC: criterion defined for a large class of models fit by maximum likelihood \[AIC = \underbrace{- 2 \log L}_\text{Deviance} + 2d\] where \(L\) is the maximized value of the likelihood function for the estimated model.

  • In the case of linear model with Gaussian errors, maximum likelihood and least squares are the same thing, and \(C_p\) and AIC are equivalent

Bayesian Information Criterion

  • BIC (Bayes Information Criterion): based on a “Bayesian” philosophy of statistics \[BIC = \frac{1}{n}(RSS + d \hat{\sigma}^2 \log n)\]

  • BIC replaces the \(2 d \hat{\sigma}^2\) used in \(C_p\) with \(d \hat{\sigma}^2 \log n\) term, where \(n\) is the number of observations

  • Since \(\log n > 2\) for \(n > 7\), the BIC statistic generally places a heavier penalty on models with many variables, and hence results in the selection of smaller models than \(C_p\)

Adjusted \(R^2\)

Remember \(R^2\), measure of goodness of fit? \[R^2 = 1 - \dfrac{RSS}{TSS}\]

Problem: since \(RSS\) is based on the training set, it can only decrease the more variables we add! Therefore \(R^2\) can only increase for more complex models!

Adjusted \(R^2\)

For a least squares model with \(d\) variables, the adjusted \(R^2\) statistic is calculated as: \[R^2_{adj} = 1 - \dfrac{RSS/(n-d-1)}{TSS/(n-1)}\]
Unlike \(C_p\), AIC and BIC, here a large value of \(R^2_{adj}\) indicates a model with a small test error.

Maximizing the \(R^2_{adj}\) is equivalent to minimizing \(RSS/(n-d-1)\). While \(RSS\) always decreases as the number of variables in the model increases, thee term \(d\) in the denominator corrects to penalize complicated models.

Example

Example

Summary

  • These information criteria are just an approximation to a true out-of-sample estimate. They rely upon some pretty specific mathematical assumptions (linearity of the data, Gaussian error, …)
  • Do not view the stepwise selection algorithm as having superpowers for discerning the single best model
  • It is a good idea to perform a quick train/test split of your data and compute out-of-sample MSE for your final model

Question time